9/7/2017

AR12673

Tom Bogdan

  • Director, Space Weather Prediction, NOAA 2006 - 2012
  • President, University Corporation for Atmospheric Research, 2012 - 2015

Big Picture

Tom Bogdan

Big Picture

Customer Impacts

  • Aviation
    • HF Communications used for Air Traffic Control
    • GPS enabled approach and landing in use at major airports
    • Radiation doses to flight crews and passengers
  • Precision GPS
    • Ubiquitous in construction and agriculture
    • Transportation network monitoring
    • Timing/sync for telecom, power grid, and financial systems
  • Satellites
    • Telecommunications relay for the planet
    • Surveillance and reconnaissance
    • National security assets on orbit
    • On demand entertainment
  • Power Grid

Outline

  1. Frontier Development Lab
  2. Heliophysics
  3. The Data
  4. Deep Learning
  5. Going Forward

Frontier Development Lab

FDL

NASA Frontier Development Lab is an AI R&D accelerator that tackles knowledge gaps useful to the space program. The program is an intense 8-week concentrated study on topics not only important to NASA, but also to humanity’s future.

Space Camp… 25 Years Later

NASA Ames

IBM's Partnership

ME!!!

Also, 16 of these for 2 months

5 Teams

  • Radar 3D Shape Modeling
  • Long Period Comets
  • Lunar Water and Volatiles
  • Solar Storm Prediction
  • Solar-Terrestrial Interactions

5 Teams

  • Radar 3D Shape Modeling
  • Long Period Comets
  • Lunar Water and Volatiles
  • Solar Storm Prediction
  • Solar-Terrestrial Interactions

XKCD 1831

Here to Help

Heliophysics

XKCD 1851

Magnetohydrodynamics

Youtube Video

Big Picture

Tom Bogdan

The Data

X-ray Flux

GOES 15

Teamwork!

flux <- read.csv("data/Flux_2010_2015.csv")
head(flux)
##                  Time         Flux Count
## 1 2010-01-01 00:02:00 5.403997e-08    58
## 2 2010-01-01 00:04:00 5.346478e-08    58
## 3 2010-01-01 00:06:00 5.423044e-08    59
## 4 2010-01-01 00:08:00 5.692712e-08    59
## 5 2010-01-01 00:10:00 5.496905e-08    58
## 6 2010-01-01 00:12:00 6.079832e-08    59

Need a script like the following one to pull down the CSVs.

Take the max over some interval.

X-ray Flux

X-ray Flux Log

Solar Dynamics Observatory (SDO)

Atmospheric Imaging Assembly (AIA)

  • Led from the Lockheed Martin Solar and Astrophysics Laboratory (LMSAL)
  • Provides continuous full-disk observations of the solar chromosphere and corona in 8 channels/temperatures
  • Image stream provides unprecedented views of the various phenomena that occur within the evolving solar outer atmosphere
    • 12-second cadence
    • 4096 by 4096 pixels
    • 2010 - Present

Petabytes of data

  • Image stream provides unprecedented views of the various phenomena that occur within the evolving solar outer atmosphere
    • 12-second cadence -> 12-minute cadence
    • 4096 by 4096 pixels -> 1024 by 1024 pixels
    • 2010 - Present -> 2014 (Jan-Jun)

From PBs to GBs…

Downloading that data

create.file.list <- function(which.months = 1:6){
  base.url <- "http://jsoc.stanford.edu/data/aia/synoptic/2014/"
  months <- sprintf("%02.f", 1:12)[which.months]
  max.days <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)[which.months]
  hours <- sprintf("%02.f", 0:23)
  minutes <- sprintf("%02.f", c(00, 12, 24, 36, 48))
  wl <- c("0094", "0131", "0171", "0193", "0211", "0304", "0335", "1600")
  target.file.list <- target.urls <- rep("", sum(max.days) * 5 * 24 * 8)
  ind <- 1
  for(i in 1:length(months)){
    for(j in 1:(max.days[i])){
      for(k in 1:length(hours)){
        for(l in 1:length(minutes)){
          for(m in 1:length(wl)){
            aia.url <- paste0(base.url, months[i], "/",
                              sprintf("%02.f", j), "/H", hours[k], "00/")
            aia.file <- paste0("AIA2014", months[i],
                               sprintf("%02.f", j), "_", hours[k], minutes[l], "_", wl[m], ".fits")
            target.file.list[ind] <- aia.file
            target.urls[ind] <- paste0(aia.url, aia.file)
            ind <- ind + 1
          }
        }
      }
    }
  }
  target.files <- list(target.file.list, target.urls)
}

Downloading that data

download.aia.2014 <- function(target.dir = "data/AIA2014/",
                              target.file.list, target.urls){
  # Check existing files
  lf <- list.files(target.dir)
  done.list <- which(target.file.list %in% lf)
  if(length(done.list) > 0){
    todo.list <- c(1:length(target.file.list))[-done.list]
  } else {
    todo.list <- c(1:length(target.file.list))
  }

  # Download only missing data  
  for(i in 1:length(todo.list)){
    try(download.file(url = target.urls[todo.list[i]],
                  destfile = paste0(target.dir,
                                    target.file.list[todo.list[i]]),
                  method = "libcurl"),
        silent = TRUE)
  }
}

Downloading that data

We'll only look at the first two samples of data; i.e. the first 16 fits files.

temp.file.list <- create.file.list(which.months = 1)
download.aia.2014(target.dir = "data/AIA2014/",
                  target.file.list = temp.file.list[[1]][1:16],
                  target.urls = temp.file.list[[2]][1:16])

Generate all channel index

aia.index.allchannels <- function(target.dir = "data/",
                                  target.file.dir = "data/AIA2014/",
                                  temp.date = Sys.Date()){
  lf <- list.files(target.file.dir)
  time.stamp <- substr(lf, 1, 16)
  wave.length <- substr(lf, 18, 21)
  time.wave.tbl <- table(time.stamp, wave.length)
  time.wave.tbl8 <- time.wave.tbl[, 1:8]
  num.channels <- apply(time.wave.tbl8, 1, sum)
  all.channels <- num.channels == 8
  sum(all.channels)
  all.channels.times <- rownames(time.wave.tbl8)[all.channels]
  year <- as.numeric(substr(all.channels.times, 4, 7))
  month <- as.numeric(substr(all.channels.times, 8, 9))
  day <-as.numeric(substr(all.channels.times, 10, 11))
  hour <- as.numeric(substr(all.channels.times, 13, 14))
  minute <- as.numeric(substr(all.channels.times, 15, 16))
  AIAindex.allchannels <- cbind(year, month, day, hour, minute)
  AIAindex.allchannels.POSIX <- as.POSIXct(paste(
    paste(AIAindex.allchannels[, 1],
          sprintf("%02.f", AIAindex.allchannels[, 2]),
          sprintf("%02.f", AIAindex.allchannels[, 3]), sep = "-"),
    paste(sprintf("%02.f", AIAindex.allchannels[, 4]),
          sprintf("%02.f", AIAindex.allchannels[, 5]), "00", sep = ":"),
    sep = " "), tz = "UTC")
  write.csv(AIAindex.allchannels.POSIX,
            paste0(target.dir, "AIA_index_allChannels_", temp.date,".csv"),
            row.names = F)
}

On file type choices

Generate Usable Data

library(doParallel)
library(foreach)
library(feather)
library(reticulate)
registerDoParallel(cores=detectCores())
astropy <- import("astropy")
aia.index.allchannels(target.dir = "data/", target.file.dir = "data/AIA2014/",
                      temp.date = "2017-09-07")
  1. Convert to Feather
  2. Take mean, sd of logged values
  3. Create PCA model
  4. Apply mean, sd, PCA
  5. Take mean, sd of PCA-ed data

Generate Usable Data - Feather

lf <- list.files("data/AIA2014/")
target.dir = "data/"
temp.date = "2017-09-07"
all.channel.index.POSIX <- as.POSIXct(read.csv(paste0(target.dir,
                                                      "AIA_index_allChannels_",
                                                      temp.date,".csv"),
                                               stringsAsFactors = FALSE)[[1]],
                                      tz = "UTC", format = "%Y-%m-%d %H:%M:%S")

all.channel.index <- outer(format(all.channel.index.POSIX, "%Y%m%d_%H%M"),
                           c("_0094", "_0131", "_0171", "_0193",
                             "_0211", "_0304", "_0335", "_1600"),
                           paste, sep = "")


foreach(i = 1:nrow(all.channel.index)) %dopar% {
  big.mat <- matrix(NA, nrow = 1024*1024, ncol = 8)
  for(j in 1:8){
    temp <- astropy$io$fits$open(paste0("data/AIA2014/AIA", all.channel.index[i, j], ".fits"))
    temp$verify("fix")
    exp.time <- as.numeric(substring(strsplit(as.character(temp[[1]]$header),
                                              "EXPTIME")[[1]][2], 4, 12))
    temp.mat <- temp[[1]]$data
    temp.mat[temp.mat <= 0] <- 0
    temp.mat <- temp.mat + 1
    big.mat[, j] <- c(t(temp.mat / exp.time))
  }
  write_feather(as.data.frame(big.mat), paste0("data/FeatherAIA2014/AIA",
                       format(all.channel.index.POSIX[i], "%Y%m%d_%H%M"),
                       ".feather"))
}

Generate Usable Data - Data Exploration

lf.train <- list.files("data/FeatherAIA2014/")
temp.mat <- read_feather(paste0("data/FeatherAIA2014/", lf.train[1]))
par(mfrow=c(2,4), mar=c(1,1,1,1), oma=c(0.5,0.5,0.5,0.5))
hist(as.matrix(temp.mat)[, 1])

Generate Usable Data - Data Exploration

par(mfrow=c(2,4), mar=c(1,1,1,1), oma=c(1,1,0.5,0.5))
for(i in 1:8){
  hist(log(as.matrix(temp.mat))[, i], main = "")
}

Generate Usable Data - Untransformed Data

par(mfrow=c(2,4), mar=c(1,1,1,1), oma=c(0.5,0.5,0.5,0.5))
temp.mat2 <- scale(temp.mat)
image(matrix(as.matrix(temp.mat2)[, 1], nrow = 1024), zlim = c(-2, 2), xaxt = "n", yaxt = "n")

Generate Usable Data - Logged/Scaled Data

temp.mat2 <- scale(log(temp.mat))
image(matrix(as.matrix(temp.mat2)[, 1], nrow = 1024), zlim = c(-2, 2),
      xaxt = "n", yaxt = "n")

Generate Usable Data - PCA'ed Data

pca.mod <- prcomp(temp.mat2)
temp.mat3 <- scale(predict(pca.mod, temp.mat2))
image(matrix(as.matrix(temp.mat3)[, 1], nrow = 1024), zlim = c(-2, 2),
      xaxt = "n", yaxt = "n")

Generate Usable Data - PCA'ed Data

Generating RGB images using the imager libary

library(imager)
prep.img <- function(vec, expo = 1){
  mat <- matrix(vec, nrow = 1024)
  temp.mat <- mat - min(mat)
  temp.mat <- (temp.mat / max(temp.mat))[, 1024:1]
  temp.mat <- temp.mat ^ expo
  c(temp.mat)
}
#expo - adjusts the shape of distribution -> adjusts RGB coloring

col <- cbind(r <- prep.img(temp.mat3[, 1], .5),
             g <- prep.img(temp.mat3[, 2], .5),
           b <- prep.img(temp.mat3[, 3], 3))
dim(col) <- c(1024, 1024, 1, 3)
plot(cimg(col), axes = F)

Generate Usable Data - PCA'ed Data

Generating RGB images using the imager libary

Generate Usable Data - Scale

mean.log <- foreach(i = 1:length(lf.train), .combine = rbind) %dopar% {
  file.name <- paste0("data/FeatherAIA2014/AIA",
                      format(all.channel.index.POSIX[i], "%Y%m%d_%H%M"),
                      ".feather")

  temp.mat <- read_feather(file.name)
  apply(log(temp.mat), 2, mean)
}

write.csv(mean.log, "data/MeanPostLog.csv", row.names = FALSE)
mean.vec <- apply(mean.log, 2, mean)

Generate Usable Data - Scale

sd.log <- foreach(i = 1:length(lf.train), .combine = rbind) %dopar% {
  file.name <- paste0("data/FeatherAIA2014/AIA",
                      format(all.channel.index.POSIX[i], "%Y%m%d_%H%M"),
                      ".feather")

  temp.mat <- read_feather(file.name)
  apply(t(t(temp.mat) - mean.vec), 2, sd)
}

write.csv(sd.log, "data/SdPostLog.csv", row.names = FALSE)
sd.vec <- sqrt(apply(sd.log ^ 2, 2, mean))

Generate Usable Data - CSV

lf <- list.files("data/FeatherAIA2014/")
foreach(i = 1:length(lf)) %dopar% {
  temp.mat <- as.matrix(read_feather(paste0("data/FeatherAIA2014/", lf[i])))
  # Take all 8 dimensions and flatten out.
  temp.mat <- round(c(t((t(temp.mat) - mean.vec) / sd.vec)[, 1:8]), 2)
  
  write.table(matrix(temp.mat, nrow = 1),
                     paste0("data/CSVAIA2014/", substring(lf[i], 1, 16), ".csv"),
              sep=',', row.names=FALSE, col.names=FALSE)
}

Generate One Big CSV

lf <- list.files("data/CSVAIA2014/")

cat.str <- paste0("cat ", paste(lf, collapse = " "),
                  " > data/AIA2014_Big.csv")
cat.str
system(cat.str)
## [1] "cat AIA20140101_0000.csv AIA20140101_0012.csv > data/AIA2014_Big.csv"

Deep Learning

XKCD 1838

Machine Learning

Deep Learning

Deep Learning

“There has been a great deal of hype surrounding neural networks, making them seem magical and mysterious…”

  • Elements of Statistical Learning

Approaches

Frameworks

  • Tensorflow
    • Google
    • Python -> C++
    • In R:
    • Supported by Rstudio
    • R -> Python -> C++
  • MXNet
    • Amazon
    • Almost anything -> C++
    • In R:
    • Supported by Rcpp Core Team
    • R -> C++

Frameworks

Tensorflow

  • Seemed to have a memory leak
    • Memory management on the GPU is important
  • Couldn't get the iterator/data loading function to work
  • My team used Keras (in Python) to access Tensorflow as a backend

MXNet

  • No memory leak
    • 100% GPU usage!
  • Iterator function worked!
  • Still no success :|

Engineering Considerations

Data

I've talked enough about data, but clearly you need lots of storage.

Hardware

Hardware

  • GPUs are important.
  • 10x speed up over CPUs (54 threads)
  • Rich man's game :/
  • Suggest getting a cheap GPU or high end CPU to develop
    • Move it over to the cloud (Softlayer, AWS) for scaling up
    • Going to want your data in cloud storage for that

Going Forward

Open Problem!

  • My team was able to get many of the data issues sorted.
  • Our best networks were either predicting constant values or overfit
  • Still no success :|
    • 1024x1024 images are 16x bigger than Imagenet
    • 20k images are 50x fewer than Imagenet
    • Knowing the correct architecture is the key
  • The Earth (and your entertainment depends on it).
  • The perfect opportunity to try deep learning in R.

The End